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Abstract 

Recurrent Neural Networks (RNNs) with Long- 
Short Term Memory units (LSTM) are widely 
used because they are expressive and are easy 
to train. Our interest lies in empirically evalu- 
ating the expressiveness and the learnability of 
LSTMs by training them to evaluate short com- 
puter programs, a problem that has traditionally 
been viewed as too complex for neural networks. 
We consider a simple class of programs that can 
be evaluated with a single left- to-right pass us- 
ing constant memory. Our main result is that 
LSTMs can learn to map the character-level rep- 
resentations of such programs to their correct 
outputs. Notably, it was necessary to use curricu- 
lum learning, and while conventional curriculum 
learning proved ineffective, we developed an new 
variant of curriculum learning that improved our 
networks' performance in all experimental con- 
ditions. 



1. Introduction 

Execution of computer programs requires dealing with 
multiple nontrivial concepts. To execute a program, a sys- 
tem has to understand numerical operations, the branching 
of if- statements, the assignments of variables, the compo- 
sitionality of operations, and many more. 

We show that Recurrent Neural Networks (RNN) with 
Long-Short Term Memory (LSTM) units can accurately 
evaluate short simple programs. The LSTM reads the pro- 
gram character-by-character and computes the program's 
output. We considered a constrained set of computer pro- 
grams that can be evaluated in linear time and constant 
memory because the LSTM reads the program only once 
and its memory is small (Section 3). Indeed, the runtime of 



the LSTM is linear in the size of the program, so it cannot 
simulate programs that have a greater minimal runtime. 

It is difficult to train LSTMs to execute computer programs, 
so we used curriculum learning to simplify the learning 
problem. We design a curriculum procedure which outper- 
forms both conventional training that uses no curriculum 
learning (baseline) as well as naive curriculum learning 
(Bengio et al., 2009) (Section 4). We provide a plausible 
explanation for the effectiveness of our procedure relative 
to naive curriculum learning (Section 7). 

Finally, in addition to curriculum learning strategies, we 
examine two simple input transformations that further sim- 
plify the learning problem. We show that, in many cases, 
reversing the input sequence (Sutskever et al., 2014) and 
replicating the input sequence improves the LSTM's per- 
formance on a memorization task (Section 3.1). 

2. Related work 

There has been related research that used Tree Neural Net- 
work (sometimes known as Recursive Neural Networks) 
to evaluate symbolic mathematical expressions and logi- 
cal formulas (Zaremba et al., 2014a; Bowman et al., 2014; 
Bowman, 2013), which is close in spirit to our work. How- 
ever, Tree Neural Networks require parse trees, and in 
aforementioned work they process operations on the level 
of words, so each operation is encoded with its index. Com- 
puter programs are also more complex than mathematical 
or logical expressions due to branching, looping, and vari- 
able assignment. 

From a methodological perspective, we formulate the pro- 
gram evaluation task as a language modeling problem 
on a sequence (Mikolov, 2012; Sutskever, 2013; Pascanu 
et al., 2013). Other interesting applications of recurrent 
neural networks includes speech recognition (Robinson 
et al., 1996; Graves et al., 2013), machine translation (Cho 
et al., 2014; Sutskever et al., 2014), handwriting recogni- 
tion (Pham et al., 2013; Zaremba et al., 2014b), and many 
more. 
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(Maddison & Tarlow, 2014) learned a language model on 
parse trees, and (Mou et al., 2014) predicted whether two 
programs are equivalent or not. Both of these approaches 
require parse trees, while we learn from a program charac- 
ter level sequence. 

Predicting program output requires that the model deals 
with long term dependencies that arise from variable as- 
signment. Thus we chose to use Recurrent Neural Net- 
works with Long Short Term Memory units (Hochreiter & 
Schmidhuber, 1997), although there are many other RNN 
variants that perform well on tasks with long term depen- 
dencies (Cho et al., 2014; Jaeger et al., 2007; Koutnik et al., 
2014; Martens, 2010; Bengio et al, 2013). 

Initially, we found it difficult to train LSTMs to accurately 
evaluate programs. The compositional nature of computer 
programs suggests that the LSTM would learn faster if we 
first taught it the individual operators separately and then 
taught the LSTM how to combine them. This approach can 
be implemented with curriculum learning (Bengio et al., 
2009; Kumar et al, 2010; Lee & Grauman, 2011), which 
prescribes gradually increasing the "difficulty level" of the 
examples presented to the LSTM, and is partially motivated 
by fact that humans and animals learn much faster when 
their instruction provides them with hard but manageable 
exercises. Unfortunately, we found the naive curriculum 
learning strategy of Bengio et al. (2009) to be generally 
ineffective and occasionally harmful. One of our key con- 
tributions is the formulation of a new curriculum learning 
strategy that substantially improves the speed and the qual- 
ity of training in every experimental setting that we consid- 
ered. 

3. Subclass of programs 

We train RNNs on class of simple programs that can be 
evaluated in O (n) time and constant memory. This re- 
striction is dictated by the computational structure of the 
RNN itself, at it can only do a single pass over the pro- 
gram using a very limited memory. Our programs use the 
Python syntax and are based on a small number of oper- 
ations and their composition (nesting). We consider the 
following operations: addition, subtraction, multiplication, 
variable assignment, if-statement, and for-loops, although 
we forbid double loops. Every program ends with a single 
"print" statement that outputs a number. Several example 
programs are shown in Figure 1 . 

We select our programs from a family of distributions pa- 
rameterized by length and nesting. The length parameter is 
the number of digits in numbers that appear in the programs 
(so the numbers are chosen uniformly from [1, 10 length ]). 
For example, the programs are generated with length = 4 
(and nesting = 3) in Figure 1 . 



Input: 

j=8584 

for x in range (8) : 

j+=920 
b= (1500 + j) 
print ( (b+7567) ) 
Target: 25011. 



Input: 

i=8827 

c= (i-5347) 

print ( (c+8704) if 264K8500 else 
5308) 
Target: 1218. 



Figure 1. Example programs on which we train the LSTM. The 
output of each program is a single number. A "dot" symbol indi- 
cates the end of a number and has to be predicted as well. 



We are more restrictive with multiplication and the ranges 
of for-loop, as these are much more difficult to handle. 
We constrain one of the operands of multiplication and the 
range of for-loops to be chosen uniformly from the much 
smaller range [1,4- length]. This choice is dictated by the 
limitations of our architecture. Our models are able to per- 
form linear-time computation while generic integer mul- 
tiplication requires superlinear time. Similar restrictions 
apply to for-loops, since nested for-loops can implement 
integer multiplication. 

The nesting parameter is the number of times we are al- 
lowed to combine the operations with each other. Higher 
value of nesting results in programs with a deeper parse 
tree. Nesting makes the task much harder for our LSTMs, 
because they do not have a natural way of dealing with 
compositionality, in contrast to Tree Neural Networks. It 
is surprising that they are able to deal with nested expres- 
sions at all. 

It is important to emphasize that the LSTM reads the input 
one character at a time and produces the output character 
by character. The characters are initially meaningless from 
the model's perspective; for instance, the model does not 
know that "+" means addition or that 6 is followed by 7. 
Indeed, scrambling the input characters (e.g., replacing "a" 
with "q", "b" with "w", etc.,) would have no effect on the 
model's ability to solve this problem. We demonstrate the 
difficulty of the task by presenting an input-output example 
with scrambled characters in Figure 2. 
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Input: 

vqppkn 
sqdvf 1 jmnc 

y2vxdddsepnimcbvubkomhrpliibtwztbl j ipcc 
Target: hkhpg 



Figure 2. An example program with scrambled characters. It 
helps illustrate the difficulty faced by our neural network. 

3.1. Memorization Task 

In addition to program evaluation, we also investigate the 
task of memorizing a random sequence of numbers. Given 
an example input 123456789, the LSTM reads it one char- 
acter at a time, stores it in memory, and then outputs 
123456789 one character at a time. We present and ex- 
plore two simple performance enhancing techniques: input 
reversing (from Sutskever et al. (2014)) and input doubling. 

The idea of input reversing is to reverse the order of the 
input (987654321) while keeping the desired output un- 
changed (123456789). It seems to be a neutral operation as 
the average distance between each input and its correspond- 
ing target did not become shorter. However, input reversing 
introduces many short term dependencies that make it eas- 
ier for the LSTM to start making correct predictions. This 
strategy was first introduced for LSTMs for machine trans- 
lation by Sutskever et al. (2014). 

The second performance enhancing technique is input dou- 
bling, where we present the input sequence twice (so the 
example input becomes 123456789; 123456789), while the 
output is unchanged (123456789). This method is mean- 
ingless from a probabilistic perspective as RNNs approx- 
imate the conditional distribution p(y\x), yet here we at- 
tempt to learn p(y\x,x). Still, it gives noticeable per- 
formance improvements. By processing the input several 
times before producing an output, the LSTM is given the 
opportunity to correct the mistakes it made in the earlier 
passes. 

4. Curriculum Learning 

Our program generation scheme is parametrized by length 
and nesting. These two parameters allow us control the 
complexity of the program. When length and nesting are 
large enough, the learning problem nearly intractable. This 
indicates that in order to learn to evaluate programs of a 
given length = a and nesting = b, it may help to first learn 
to evaluate programs with length <C a and nesting <C b. 
We compare the following curriculum learning strategies: 

No curriculum learning (baseline) The baseline approach 
does not use curriculum learning. This means that we 



generate all the training samples with length = a and 
nesting = b. This strategy is most "sound" from statis- 
tical perspective, as it is generally recommended to make 
the training distribution identical to test distribution. 

Naive curriculum strategy (naive) 

We begin with length = 1 and nesting = 1. Once learning 
stops making progress, we increase length by 1. We repeat 
this process until its length reaches a, in which case we 
increase nesting by one and reset length to 1. 

We can also choose to first increase nesting and then length. 
However, it does not make a noticeable difference in per- 
formance. We skip this option in the rest of paper, and 
increase length first in all our experiments. This strategy is 
has been examined in previous work on curriculum learn- 
ing (Bengio et al., 2009). However, we show that often it 
gives even worse performance than baseline. 

Mixed strategy (mix) 

To generate a random sample, we first pick a random length 
from [1, a] and a random nesting from [1,6] independently 
for every sample. The Mixed strategy uses a balanced mix- 
ture of easy and difficult examples, so at any time during 
training, a sizable fraction of the training samples will have 
the appropriate difficulty for the LSTM. 

Combining the mixed strategy with naive curriculum 
strategy (combined) 

This strategy combines the mix strategy with the naive 
strategy. In this approach, every training case is obtained 
either by the naive strategy or by the mix strategy. As a 
result, the combined strategy always exposes the network 
at least to some difficult examples, which is the key way in 
which it differs from the naive curriculum strategy. We no- 
ticed that it reliably outperformed the other strategies in our 
experiments. We explain why our new curriculum learning 
strategies outperform the naive curriculum strategy in Sec- 
tion 7. 

We evaluate these four strategies on the program evaluation 
task (Section 6.1) and on the memorization task (Section 
6.2). 

5. RNN with LSTM cells 

In this section we briefly describe the deep LSTM (Sec- 
tion 5.1). All vectors are n-dimensional unless explicitly 
stated otherwise. Let h\ G R n be a hidden state in layer 
I in timestep t. Let T n , m : R n R m be a biased lin- 
ear mapping (x — ^ Wx + b for some W and b). We 
let 0 be element-wise multiplication and let h® be the in- 
put at timestep k. We use the activations at the top layer 
L (namely h\) to predict y t where L is the depth of our 
LSTM. 
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Figure 3. A graphical representation of the LSTM memory cells 
used in this paper (they differ in minor ways from Graves (2013)). 

5.1. Long-short term memory units 

The structure of the LSTM allows it to learn on prob- 
lems with long term dependencies relatively easily. The 
"long term" memory is stored in a vector of memory cells 
c\ G M n . Although many LSTM architectures differ in 
their connectivity structure and activation functions, all 
LSTM architectures have memory cells that are suitable for 
storing information for long periods of time. We used an 
LSTM described by the following equations (from Graves 
etal. (2013)): 
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In these equations, the nonlinear functions sigm and tanh 
are applied elementwise. Figure 3 shows the LSTM equa- 
tions (the figure is taken from Zaremba et al. (2014b)). 

6. Experiments 

In this section, we report the results of our curriculum 
learning strategies on the program evaluation and mem- 
orization tasks. In both experiments, we used the same 
LSTM architecture. 

Our LSTM has two layers and is unrolled for 50 steps in 
both experiments. It has 400 units per layer and its param- 
eters are initialized uniformly in [—0.08, 0.08]. We initial- 
ize the hidden states to zero. We then use the final hidden 
states of the current minibatch as the initial hidden state 
of the subsequent minibatch. The size of minibatch is 100. 
We clip the norm of the gradients (normalized by minibatch 



size) at 5 (Mikolov et al., 2010). We keep the learning rate 
equal to 0.5 until we reach the target length and nesting (we 
only vary the length, i.e., the number of digits, in the mem- 
orization task). After reaching the target accuracy we de- 
crease the learning rate by 0.8. We keep the learning rate on 
the same level until there is no improvement on the training 
set. We decrease it again, when there is no improvement 
on training set. We begin training with length = 1 and 
nesting = 1 (or length-\ for the memorization task). 

To prevent training samples from being repeated in the test 
set, we enforced that the training, validation, and test sets 
are disjoint. 

6.1. Program Evaluation Results 

We train our LSTMs using the four strategies described in 
Section 4: 

• No curriculum learning (baseline), 

• Naive curriculum strategy (naive) 

• Mixed strategy (mix), and 

• Combined strategy (combined). 

Figure 4 shows the absolute performance of the baseline 
strategy (training using target test data distribution), and of 
the best performing strategy, combined. Moreover, Figure 
5 shows the performance of all strategies relative to base- 
line. Finally, we provide several example predictions on 
test data in Figure 6. 



"Baseline" strategy 



"Combined" strategy 




|64% 46% 40% 

60% 7 J68% 50% 43% 41% 

-50% 6 J72%^B|54% 45% 42% 

5^77% ^^57% 50% 47% 

|63% 55% 50% 



nesting 



Figure 4. Absolute prediction accuracy of the baseline strategy 
and of the combined strategy (see Section 4) on the program eval- 
uation task. Deeper nesting and larger length make the task more 
difficult. Overall, the combined strategy outperformed the base- 
line strategy in every setting. 



6.2. Memorization Results 

Recall that the task is to copy a sequence of input values. 
Namely, given an input such as 123456789, the goal is to 
produce the output 123456789. The model accesses one 
input character at the time and has to produce the output 
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"Naive" strategy relative to the "Baseline" 



"Mix" strategy relative to the "Baseline" 




nesting nesting 
"Combined" strategy relative to the "Baseline" 
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BIEl EBB BSfl EES! 
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BBS BBS 
BBS BBS 



Figure 5. Relative prediction accuracy of the different strategies 
with respect to the baseline strategy. The Naive curriculum strat- 
egy was found to sometime perform worse than baseline. A pos- 
sible explanation is provided in Section 7. The combined strategy 
outperforms all other strategies in every configuration. 



only after holding the entire input in its memory. This task 
gives us insights into the LSTM's ability to memorize and 
remember information. We have evaluated our model on 
sequences of lengths ranging from 5 to 65. We use the 
four curriculum strategies of Section 4. In addition, we 
investigate two strategies to modify the input which boost 
performance: 



Input: 

f=(8794 if 8887<9713 else 

print ( (f+574) ) 
Target: 9368. 
Model prediction: 9368. 



(3*8334) 



Input: 

j=8584 

for x in range (8) : 

j+=920 
b= (1500 + j) 
print ( (b + 7567) ) 

Target: 25011. 

Model prediction: 23011. 



Input: 

c=445 
d= (c-4223) 
for x in range (1) : 
d+=5272 

print ((8942 if d<3749 else 2951) 
Target: 8942. 
Model prediction: 8942. 



• inverting input (Sutskever et al., 2014) 

• doubling input 

Both strategies are described in Section 3.1. Figure 7 shows 
the absolute performance of the baseline strategy and of 
the combined strategy. It is clear that the combined strat- 
egy outperforms every other strategy. Each graph contains 
4 settings, which correspond to the possible combinations 
of input inversion and input doubling. The result clearly 
shows that the simultaneously doubling and reversing the 
input achieves the best results. 

7. Hidden State Allocation Hypothesis 

Our experimental results suggest that a proper curriculum 
learning strategy is critical for achieving good performance 
on very hard problems where conventional stochastic gra- 
dient descent (SGD) performs poorly. The results on both 



Input: 

a=1027 

for x in range (2) : 

a+=(402 if 6358>8211 else 215£ 

print (a) 
Target: 5343. 
Model prediction: 5293. 



Figure 6. Sample predictions generated by our model trained with 
the combined strategy. Here length is 4 and nesting is 3. The 
model makes interesting mistakes. For instance, it occasionally 
makes off-by-one errors yet it has no built-in notion of integer- 
order. 
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Figure 7. Prediction accuracy on the memorization task for the 
four curriculum strategies. The input length ranges from 5 to 
65 digits. Every strategy is evaluated with the following 4 in- 
put modification schemes: no modification; input inversion; input 
doubling; and input doubling and inversion. Training time was 
limited to 20 epochs. 



of our problems (Sections 6.2 and 6.1) show that the com- 
bined strategy is better than all other curriculum strategies, 
including both naive curriculum learning, and training on 
the target distribution. We have a plausible explanation for 
why this is the case. 

It seems natural to train models with examples of increas- 
ing difficulty. This way the models have a chance to learn 
the proper intermediate concepts and input-output map- 
ping, and then utilize them for the more difficult problem 
instances. Learning the target task might be just too diffi- 
cult with SGD from a random parameter initialization. This 
explanation has been proposed in previous work on cur- 
riculum learning (Bengio et al., 2009). However, based on 
empirical results, the naive strategy of curriculum learning 
can sometimes be worse than learning using just with the 
target distribution. 

In our tasks, the neural network has to perform a lot of 
memorization. The easier examples usually require less 
memorization than the hard examples. For instance, in or- 
der to add two 5 -digit numbers, one has to remember at 
least 5 digits before producing any output. The best way 
to accurately memorize 5 numbers could be to spread them 
over the entire hidden state / memory cell (i.e., use a dis- 
tributed representation). Indeed, the network has no incen- 
tive to utilize only a fraction of its state. It is always best 
to make use of its entire memory capacity. This implies 
that the harder examples would require a restructuring of 
its memory patterns. It would need to contract its repre- 
sentations of 5 digit numbers in order to free space for the 
6-th number. This process of memory pattern restructur- 
ing might be difficult to achieve, so it could be the reason 
for the relatively poor performance of the naive curriculum 
learning strategy (relative to baseline). 

The combined strategy avoids the abrupt problem of re- 
structuring memory patterns, combined is a mixture of 
naive curriculum learning strategy and of balanced mixture 
of examples of all difficulties. The examples produced by 
the naive curriculum strategy help to learn the intermedi- 
ate input-output mapping, which is useful for solving the 
target task. The extra samples of all difficulties prevent the 
network from utilizing all the memory on the easy exam- 
ples, thus eliminating the need to restructure the memory 
patterns. 

8. Critique 

Perfect prediction of program output requires an exact un- 
derstanding of all operands and concepts. However, imper- 
fect prediction might be achieved in a multitude of ways, 
and could heavily rely on memorization, without a genuine 
understanding of the underlying concepts. For instance, 
perfect addition is relatively intricate, as the LSTM needs 
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to know the order of numbers and to correctly compute the 
carry. 

There are many alternatives to the addition algorithm if per- 
fect output is not required. For instance, one can perform 
element-wise addition, and as long as there is carry then 
the output would be perfectly correct. Another alternative, 
which requires more memory, but is also more simpler, is 
to memorize all results of addition for 2 digit numbers. 
Then multi-digit addition can be broken down to multiple 
2-digits additions element-wise. Once again, such an al- 
gorithm would have a reasonably high prediction accuracy, 
although it would be far from correct. 

We do not know how heavily our model relies on memo- 
rization and how far the learnt algorithm is from the actual, 
correct algorithm. This could be tested by creating a big 
discrepancy between the training and test data, but in this 
work, the training and the test distributions are the same. 
We plan to examine how well our models would generalize 
on very different new examples in future work. 

9. Discussion 

We have shown that it is possible to learn to evaluate pro- 
grams with limited prior knowledge. This work demon- 
strate the power and expressiveness of LSTMs. We also 
showed that proper curriculum learning is crucial for get- 
ting good results on very difficult tasks that cannot be opti- 
mized with conventional SGD. 

We also found that the general method of doubling the input 
reliably improves the performance of LSTMs. 

Our results are encouraging but they leave many questions 
open, such as learning to execute generic programs (e.g., 
ones that run in more than O (n) time). This cannot be 
achieved with conventional RNNs or LSTMs due to their 
runtime restrictions. We also do not know the optimal cur- 
riculum learning strategy. To understand that, we may need 
to identify those training samples that are most beneficial 
to the model. 
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